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Using eye-tracking methodology, gaze to a speaking face was compared in a group of 
children with autism spectrum disorders (ASD) and a group with typical development (TD). 
Patterns of gaze were observed under three conditions: audiovisual (AV) speech in auditory 
noise, visual only speech and an AV non-face, non-speech control. Children with ASD 
looked less to the face of the speaker and fixated less on the speakers' mouth thanTD 
controls. No differences in gaze were reported for the non-face, non-speech control task. 
Since the mouth holds much of the articulatory information available on the face, these 
findings suggest that children with ASD may have reduced access to critical linguistic 
information. This reduced access to visible articulatory information could be a contributor 
to the communication and language problems exhibited by children with ASD. 

Keywords: autism spectrum disorders, audiovisual speech perception, eyetracking, communication development, 
speech in noise, lipreading 



INTRODUCTION 

Autism spectrum disorders (ASD) refer to neurodevelopmental 
disorders along a continuum of severity that are generally char- 
acterized by marked deficits in social and communicative func- 
tioning (American Psychiatric Association, 2000). A feature of the 
social deficits associated with ASD is facial gaze avoidance and 
reduced eye contact with others in social situations (Hutt and Oun- 
stead, 1966; Hobson et al., 1988; Volkmar et al., 1989; Volkmar and 
Mayes, 1990; Phillips et al., 1992). One implication of this reduced 
gaze to other's faces is a potential difference in face processing. A 
number of studies have suggested that individuals with ASD show 
differences in face processing, including impaired face discrimi- 
nation and recognition (for a review see Dawson et al., 2005, but 
see Jemel et al, 2006 for evidence that face processing abilities are 
stronger in ASD than previously reported) and identification of 
emotion (Pelphrey et al., 2002). 

Along with identity and affective information, the face pro- 
vides valuable information about a talker's articulations. Visible 
speech information influences what typically developing listen- 
ers hear (e.g., increases identification in the presence of auditory 
noise, Sumby and Pollack, 1954) and is known to facilitate 
language processing (McGurk and MacDonald, 1976; MacDon- 
ald and McGurk, 1978; Reisberg etal., 1987; Desjardins etal., 
1997; MacDonald etal, 2000; Lachs and Pisoni, 2004). Fur- 
ther, typical speech and language development is thought to take 
place in an audiovisual (AV) context (Meltzoff and Kuhl, 1994; 
Desjardins etal., 1997; Lachs etal., 2001; Bergeson and Pisoni, 
2004). Thus, differences in access to visible speech information 
would have significant consequences for a perceiver. For exam- 
ple, there is evidence that the production of speech differs in 
blind versus sighted individuals (for example, sighted speakers 
produce vowels further apart in articulatory space than those of 
blind speakers, ostensibly because of their access to visible con- 
trasts; Menard etal, 2009), suggesting that speech perception 



and production is influenced by experience with the speaking 
face. 

Consistent with their difficulties with information on faces, a 
growing body of literature indicates that children with ASD are 
less influenced by visible speech information than TD controls 
(De Gelder et al, 1991; Massaro and Bosseler, 2003; Williams et al., 
2004; Mongillo etal, 2008; Iarocci etal, 2010; Irwin etal, 2011, 
but see Iarocci and McDonald, 2006 and Woynaroski et al., 2013). 
In particular, children and adolescents with ASD appear to benefit 
less from the visible articulatory information on the speaker's face 
in the context of auditory noise (Smith and Bennetto, 2007; Irwin 
et al., 201 1). Further, children with ASD have been reported to be 
particularly poor at lipreading (Massaro and Bosseler, 2003). 

Although avoidance of gaze to others' faces has been noted 
clinically, the exact nature of gaze patterns to faces in ASD has 
been a topic of investigation. A varied body of research using 
eye-tracking methodology has examined patterns of facial gaze 
patterns in individuals with ASD, in particular with complex 
social situations and with affective stimuli. A number of studies 
find that individuals with ASD differ in the amount of fixa- 
tions to the eye region of the face when compared to typically 
developing (TD) controls (Klin etal, 2002; Pelphrey etal, 2002; 
Dalton etal., 2005; Boraston and Blakemore, 2007; Speer etal., 
2007; Kleinhans etal., 2008; Sterling etal., 2008). In particular, 
during affective or emotion based tasks, individuals with ASD 
have been reported to spend significantly more time looking at 
the mouth (Klin etal, 2002; Neumann etal, 2006; Spezio etal., 
2007). However, a recent review by Falck-Ytter and von Hofsten 
(201 1) calls into question whether individuals with ASD look less 
to the eyes and more to the mouth when gazing at faces; they 
argue that only limited support exists for this in adults and even 
less evidence in children. Apart from gaze to eyes and mouth, 
some studies show increased gaze at "non-core" features (e.g., 
regions other than the eyes, nose, and mouth) of the face by 
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us individuals with ASD compared to TD controls, when gazing 

us at facial expression of emotion (Pelphrey etal, 2002). Reports 

H7 of differences in patterns of gaze to faces are not unequivocal, 

H8 however, with a number of studies reporting no group differ- 

H9 ences in certain tasks (Adolphs etal., 2001; Speer etal, 2007; 

120 Kleinhans etal., 2008). Further, when assessing gaze to a face, 

121 pattern of gaze may be a function of both language skill and 

122 development. Norbury etal. (2009) report that pattern of gaze 

123 to the mouth is associated with communicative competence in 

124 ASD. Reported differences in gaze to faces in children with ASD 

125 appear to vary depending on the age of the child (Dawson etal., 

126 2005; Chawarska and Shic, 2009; Senju and Johnson, 2009). 

127 Moreover, recent work by Foxe etal. (2013) suggests that mul- 

128 tisensory integration deficits present in children with ASD may 

129 resolve in adulthood (although subtle differences may persist; 

130 Saalasti etal, 2012). 

131 Critically, little is known about gaze to the face during speech 

132 perception tasks. A question that arises is whether the previously 

133 reported deficit in visual speech processing in children with ASD 

134 might simply be a consequence of a failure to fixate on the face. 

135 However, recent findings by Irwin etal. (2011) provide evidence 

136 against this possibility. Irwin etal. (2011) tested children with 

137 ASD and matched TD peers on a set of AV speech perception 

138 tasks while concurrently recording eye fixation patterns. The tasks 

139 included a speech-in-noise task with auditory-only (static face) 

140 and AV syllables (to measure the improvement in perceptual iden- 

141 tification with the addition of visual information), a McGurk task 

142 (with mismatched auditory and visual stimuli), and a visual-only 

143 (speechreading) task. Crucially, Irwin etal. (2011) excluded all 

144 trials where the participant did not fixate on the speaker's face. 

145 They found that even when fixated on the speaker's face, children 

146 with ASD were less influenced by visible articulatory information 

147 than their TD peers, both in the speech-in-noise tasks and with 

148 AV mismatched (McGurk) stimuli. Moreover, the children with 

149 ASD were less accurate at identifying visual-only syllables than the 
iso TD peers (although their overall speechreading accuracy was fairly 

151 high). 

152 Irwin etal.'s (2011) findings indicate that fixation on the face 

153 is not sufficient to support efficient AV speech perception. This 

154 could suggest differences in how visual speech information is pro- 

155 cessed in individuals with ASD. However, it could also be due 

156 to different gaze patterns on a face exhibited by individuals with 

157 ASD. Perhaps if they tend to fixate on different regions of the 

158 face than TD individuals, individuals with ASD have reduced 

159 access to critical visual information. Consistent with this pos- 

160 sibility is evidence that attentional factors can modulate visual 

161 influences in speech perception in typical adults; visual influence 

162 is reduced when perceivers are asked to attend to a distractor stim- 

163 ulus on the speaker's face (Alsius et al., 2005). Typically developing 

164 adults have been shown to increase gaze to the mouth area of the 

165 speaker as intelligibility decreases during AV speech tasks (Yi et al., 

166 20 1 3). Further, Buchan etal. (2007) report that typically devel- 

167 oping adults gaze to a central area on the face in the presence 

168 of AV speech in noise, reducing the frequency of gaze fixations 

169 on the eyes and increasing gaze fixations to the nose and the 

170 mouth. If children with ASD do not have access to the same visible 

171 articulatory information as the TD controls because their gaze 



patterns differ, this may influence their perception of a speaker's 172 

message. 173 

To assess whether there are differences in gaze that underlie 174 

the AV speech perception differences in children with ASD as 175 

compared to children with typical development, for the present 176 

paper we conducted a detailed analysis of the eye-gaze patterns 177 

for the participants and tasks reported in Irwin etal. (2011). In 178 

particular, we examined patterns of gaze to a speaking face under 179 

perceptual conditions where there is an incentive to look at the 180 



face: (1) in the presence of auditory noise and (2) where no audi- isi 

tory signal is present (speechreading). We tested whether children 182 

with ASD differ from TD controls not only in overall time spent 183 

on the face, but also in the relative amount of time spent fixat- 184 

ing on the mouth and non-focal regions. We further examined 185 

whether the two groups differ in the time-course of eye-gaze pat- 186 

terns to these regions over the course of a speech syllable. Given 187 

that the children with ASD in this sample exhibited poorer use 188 

of visual speech information than the TD controls in percep- 189 

tual measures (both for visual-only and AV speech), the analyses 190 

reported here may shed some light on the basis for these differ- 191 

ences: Is reduced use of visual speech information in perception 192 

associated with differences in patterns of fixation on the talking 193 

face? 194 

Finally, as a control for the possibility that there are more gen- 195 

eral group differences in gaze pattern unrelated to faces, we also 196 

analyzed gaze patterns in a control condition with dynamic AV 197 

non-face, non-speech stimuli. 198 

199 

MATERIALS AND METHODS 200 

PARTICIPANTS 201 

Participants in the current study were 20 native English speak- 202 

ing monolingual children, 10 with ASD (eight boys, mean age 203 

10.2 years, age range 5.58-15.9 years) and 10 TD controls (eight 204 

boys, mean age 9.6, age range 7-12.6 years). Because the speech 205 

conditions in this study required the child participants to report 206 

what the speaker said, all participants in this study were verbal. 207 

All child participants were reported by parents to have normal 208 

or corrected-to-normal hearing and vision. The TD participants 209 

had no history of developmental delays including vision, hearing, 210 

speech or language problems, by parent report. 211 

The TD controls were matched with the child ASD partic- 212 

ipants on sex, age, cognitive functioning and language skill. 213 

The TD controls were taken from a larger set of children 214 

participating in a study of speech perception (n = 80). In 215 

addition, the primary caregivers of children with ASD com- 216 

pleted a diagnostic interview [autism diagnostic interview-revised 217 

(ADI-R), Lord etal, 1994] about their children (n = 10 adult 218 

females). 219 

Prior to their participation in the study, child participants 220 

with ASD received a diagnosis from a licensed clinician. Four 221 

participants had a diagnosis of autism, four of Asperger syn- 222 

drome and two were diagnosed with pervasive developmental 223 

disorder not otherwise specified (PDD-NOS); these diagnoses all 224 

fall within the classification of ASD. For characterization pur- 225 

poses, participants with ASD were also assessed with the autism 226 

diagnostic observation schedule (ADOS; Lord etal., 2000), and 227 

their caregivers (n = 10) were interviewed with the ADI-R (Lord 228 
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229 etal., 1994). All participants with ASD met or exceeded cut- video in Final Cut Pro. Two tokens of /ma/ and /na/ were selected as 286 

230 off scores for autism spectrum or autism proper on the ADOS stimuli. The stimuli were trimmed to start with the mouth position 287 

231 algorithm. Scores obtained from caregiver interviews showed at rest, followed by an opening gesture, closing for the consonant, 288 

232 that the children with ASD met or exceeded cutoff criteria on and release of the consonant into the following vowel, and ended 289 

233 the language/communication, reciprocal social interactions and with the mouth returning to rest at the end of the syllable. The 290 

234 repetitive behavior/interest domains on the ADI-R. Consistent stimuli were approximately 1500 ms long, with the acoustic onset 291 

235 with the range of clinical diagnoses, there was heterogeneity in of the consonant (for the AV stimuli) occurring at around 600 ms; 292 

236 the extent of social and communication deficits and presence of the acoustic portions of the stimuli were approximately 550 ms in 293 

237 restricted and repetitive behavior (for example, scores on the com- duration, on average. 294 

238 bined communication and social impairment scales in the ADOS For AV speech in noise, the stimuli were AV stimuli of /ma/ and 295 

239 ranged from 7 to 20, where 10 is the minimum cutoff score and 22 /na/. Three versions of each stimulus was created by setting the 296 

240 is the maximum possible score). mean dB of the syllables at 60 dBA, and then adding pink noise at 297 

241 The mean age and standard deviations of the child ASD and 70, 75, and 70 dBA to the AV /ma/ and /na/ tokens to create stimuli 298 

242 child TD participants, along with measures of cognitive and lan- with a range of signal-to-noise levels from less to more noisy (i.e., 299 

243 guage functioning, are presented in Table 1. The measures of —10, —15, and —20 dB S/N, respectively). Noise onset and offset 300 

244 cognitive functioning were standardized scores for general con- were aligned to the auditory speech syllable onset and offset. 301 

245 ceptual ability (GCA) on the Differential Abilities Scale (DAS); the The visual-only (speechreading) stimuli were identical to the 302 

246 measures of language function were core language index scores AV stimuli, except that the audio channel was removed. 303 

247 (CLI) from the clinical evaluation of language fundamentals-4 304 

248 (CELF-4; Semel et al., 2003). Independent-samples f-tests on age, Non-speech control stimuli. The AV non-speech stimuli consisted 305 

249 GCA, and CLI did not reveal significant differences between the of a set of figure-eight shapes that increased and decreased in size, 306 

250 groups, as shown in Table 1 paired with sine-wave tones that varied in frequency and ampli- 307 

251 The sample included here represents a subset of the partici- tude. These stimuli were modeled on the speaker's productions 308 

252 pants whose data were reported in Irwin et al. (20 1 1 ) . The data of of /ma/ and / na/ but did not look or sound like speech. To create 309 

253 three children with ASD and one TD control were excluded from the visual stimulus, we measured the lip aperture in every video 310 

254 the present analyses because they spent too little time fixating on frame of the /ma/ and /na/ syllables. We then used the aperture 311 

255 the face to permit statistical analysis. The data of two other TD vam es to drive the size of the figure: when the lips closed the figure 312 

256 control participants were also removed due to the removal of their wa s small, and upon consonant release into the vowel the figure 313 

257 respective matched ASD participants. expanded (see Figures 1C,D). The auditory stimuli were created 314 

258 by converting the auditory /ma/ and /na/ syllables into sine-wave 315 

259 MATERIALS analogs, which consist of three or four time-varying sinusoids, 316 

260 Stimuli following the center-frequency and amplitude pattern of the spec- 317 

261 Speech stimuli. The speech stimuli were created from a record- tral P eaks of an France (Remez et al, 1981). These sine-wave 318 

262 ing of the productions of a male, monolingual, native speaker of analo S s sound like cmr P s or tones - Thus > the AV non-speech stim- 319 
2 « American English. This speaker was audio- and video-recorded in uh retamed the temporal dynamics of speech, without looking or 320 

264 a recording booth producing a randomized list of the consonant- funding like a speaking face (see Figures 1A-E). 

265 vowel (CV) syllables /ma/ and /na/. The video was centered on 322 

266 the speaker's face and was framed from just above the top of Visual tracking methodology 323 

267 the speaker's head to just below his chin, and was captured at Visual tracking was done with an ASL Model 504 pan/tilt remote 324 

268 640 x 480 pixels. The audio was simultaneously recorded to com- tracking system, a remote video-based single eye tracker that uses 325 

269 puter and normalized for amplitude, and then realigned with the bri S ht P u Pil. coaxial illumination to track both pupil and corneal 326 

270 reflections at 120 Hz. To optimize the accuracy of the pupil coor- 327 

271 dinates obtained by the optical camera, this model has a magnetic 328 

272 head tracking unit that tracks the position of a small magnetic 329 

273 Table 1 | Mean age and cognitive and language measures for the sensor att ached to the head of the participant, above their left eye. 330 
children with ASD and TD. 

274 331 

275 ASD TD T Language assessment 332 

276 Language ability was assessed with the CELF-4 (Semel etal., 2003). 333 

277 n -10 10 The CELF-4 is reliable in assessing the language skills of chil- 334 

278 Age 10 2(3 1) 9 6(2 4) f(18) — -0 51 ns dren in the general population and those with a clinical diagnosis 335 

279 ' " ' including ASD (Semel etal, 20 03). 336 
General conceptual ability 92.1(15.5) 98.9(15.5) f(18) = 0.97, ns & 

280 337 

281 (GCA) Cognitive assessment 338 

282 Core language index scores 87.4(17.3) 97.8(15.1) f(18) = 1.4, ns Cognitive ability was assessed using the Differential Ability Scales 339 
281 (CLI) (DAS) School Age Cognitive Battery (Elliott, 1991). The DAS 340 

2 §4 provides a GCA score, which assesses verbal ability, non-verbal 341 

2 8 5 GCA and CLI are standardized scores. Standard deviations are in parentheses. reasoning ability and spatial ability 342 
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Panel A 



Panel B 



Panel C 



Panel D 



Panel E 




Bin 1 
0-300ms 



Bin 2 
300-600ms 



Bin 3 
600-900ms 



Bin 4 
900-1200ms 



Bin 5 
1200-1 500ms 



FIGURE 1 | Sample images of the speaker (top panels) during a 
production of /ma/ and the corresponding non-speech figure-eight 
shapes (lower panels) taken from each time bin. Panels A through E 



illustrate, respectively, the initial rest position (A), opening prior to the 
consonant closing gesture (B), the closure for /m/ (C), peak mouth opening 
for the vowel (D), and the return to rest at the end of the vowel (E). 
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ADOS 

Children with ASD were assessed with the ADOS generic (ADOS- 
G). The ADOS is a semi-structured standardized assessment of 
communication, social interaction, and play/imaginative use of 
materials for individuals suspected of having an ASD (Lord et al, 
2002). 

ADI-R 

Caregivers of participants with ASD were given the ADI-R (Lord 
etal., 1994). The ADI-R is a standardized, semi-structured inter- 
view for caregivers of those with an ASD to assess autism 
symptomatology. 

PROCEDURE 

After consent was obtained in accordance with the Yale Uni- 
versity School of Medicine, all participants completed the 
experimental tasks in the eye-tracker. Each participant was 
placed in front of the monitor, after which calibration of 
the participant's fixation points in the eye-tracker was com- 
pleted. Prior to any stimulus presentation for each task, direc- 
tions appeared on the monitor. These directions were read 
aloud to the participant by a researcher to ensure that they 
understood the task. In addition, two practice items were 
completed with the researcher present to confirm that the 
participant understood and could complete the task. For all 
conditions, if participants were unsure, they were asked to 
guess. 

Condition 1: AV speech in noise 

Participants were told that they would see and hear a man saying 
some sounds that were not words and to say out loud what they 
heard. Each of the six stimuli (two different tokens of each /ma/ 
and /na/, at each of the three levels of signal-to-noise ratios) was 
presented four times, for a total of 24 trials in a random sequence. 

Condition 2: visual only (speechreading) 

Participants were told that they would see a man saying some 
sounds that they would not be able to hear, and then asked to say 



out loud what they thought the man was saying. Each of the four 
stimuli (two different tokens of each /ma/ and /na) was presented 
five times, for a total of 20 trials in a random sequence. 

Condition 3: non-speech control 

For this task, two stimuli were presented in sequence on each trial. 
The paired stimuli were either modeled on different tokens of 
the same syllable (e.g., both /ma/ or both /na/) or on tokens of 
different syllables (one /ma/ and one /na/). Participants were told 
that they would see two shapes that would open and close and 
should say out loud whether the two shapes opened and closed 
in the same way (e.g., both modeled on /ma/ or both modeled 
on /na/, although no reference was made to the speech origins 
of the stimuli to participants) or if the way that they closed was 
different (e.g., one modeled on /ma/ and one on /na/). Each pairing 
was presented seven times, for a total of 28 trials in a random 
sequence. 

The three tasks were blocked and presented in random order. 
The inter- stimulus interval for all trials within the blocks was 
3 s. After every five trials, participants were presented with 
a slide of animated shapes and faces, to maintain attention 
to the task. All audio stimuli were presented at a comfort- 
able listening level (60 dBA) from a centrally located speaker 
under the eye-tracker, and visual stimuli were presented at a 
640 x 480 aspect ratio on a video monitor 30 inches from the 
participant. 

After the experimental procedure participants were tested with 
the battery of cognitive and language assessments and caregivers of 
the ASD participants were interviewed separately with the ADI-R. 

RESULTS 

Participant gaze to the speaker's face was examined by group for the 
AV speech-in-noise and visual-only (speechreading) trials, as was 
gaze on the figure-eight shape in non-speech trials. The eye tracker 
recorded fixation position in x and y coordinates at approximately 
8 ms intervals. (In cases where the coordinates were not recorded, 
the x- and y-coordinates of the previous time point were applied). 
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457 Each x-y coordinate was coded according to whether it was on- 

458 screen or off-screen, and if it was on-screen, whether it was part 

459 of an on- face fixation or not. Off-screen fixations were eliminated 

460 from the data. 

461 The on-face coordinates were coded according to face regions, 

462 namely: forehead, jaw, cheeks, ears, eyes, mouth region (including 

463 the spaces between the lower lip and the jaw and between the upper 

464 Hp and the nose), and nose. The primary regions of interest were 

465 the mouth region and a collective set of non-focal regions (face areas 

466 other than the mouth region, eyes, and nose), in light of reports 

467 that children with ASD spend relatively more time fixating on 

468 non-focal regions of the face (Pelphrey et al, 2002). The non-focal 

469 regions encompassed the ears, the cheeks, the forehead, and all 

470 other regions not otherwise labeled (primarily the space between 

471 the eye and the ear, between the nose and cheek, and between the 

472 eyes). The jaw area was not included in either the mouth region 

473 or the non-focal regions; this is because the jaw, unlike the other 

474 non-focal regions, has extensive movement that is time-locked to 

475 the speech articulation - thus, jaw movement conveys information 

476 about the kinematics of the speech act. 

477 For the non-speech condition, the on-screen regions were 

478 coded in an analogous manner, based on the extent of the 

479 figure-eight shape. These regions are described below. 

480 Data points were only included as fixations if they had less than 

481 a 40 pixel movement from the previous time point, and occurred 

482 within a contiguous 100 ms window of similar small movements 

483 that did not cross into a different face region, as defined above. In 

484 all, 14.5% of the time steps were eliminated across the AV speech- 

485 in-noise and visual-only tasks for being either off-screen, saccades, 

486 or blinks. Although the mean percentage of dropped data points 

487 was higher for the ASD sample than for the TD sample, the differ- 

488 ence was not statistically significant [for AV speech-in-noise, ASD: 

489 M = 19.4%, SD = 13.3; TD: M = 11.8%, SD = 7.4; f(18) = 1.60, 

490 ns; for visual-only, ASD: M= 17.0%, SD = 12.0; TD:M= 10.0%, 

491 SD = 5.3; f(18) = 1.70, ns]. 

492 The individual time steps were collapsed into 300 ms time bins 

493 (0-300 ms, 300-600 ms, 600-900 ms, 900-1200 ms, and 1200- 

494 1 500 ms); we thus calculated the total amount of time spent in 

495 each region within each time bin. These time bin boundaries were 

496 selected because they roughly corresponded to visual landmarks 

497 in the speech signal. The first bin (0-300 ms) preceded the onset of 

498 visible movement; the second bin (300-600 ms) included open- 

499 ing of the mouth prior to the consonant and the initiation of 
soo closing (either lips in /ma/ or upward tongue-tip movement in 

501 /na/); the third bin (600-900 ms) included the consonantal clo- 

502 sure and release, and the final two time bins (900-1200 ms and 

503 1 200-1500 ms, respectively) span production of the vowel until 

504 the end of the trial (for an image of articulation in each of the 

505 time bins paired with the corresponding figure-eight shape, see 

506 Figure 1). 

507 As a result, our dependent variables were the mean percentage 

508 of time gazing on a given region within a time bin. Time spent 

509 fixating on the face was calculated as a percentage of time fix- 

510 ated anywhere on the computer monitor within each time bin. In 

511 contrast, time spent fixating on specific face regions (mouth region 

512 and non-focal areas) was calculated as a percentage of time spent 

513 fixated on the face within each time bin. 



First, we examined whether there were group differences in 514 

the percentage of time spent fixating on the face of the speaker 515 

out of time spent fixating on-screen. Figure 2 presents the mean 516 

time spent on face by group and time bin separately for the AV 517 

speech-in-noise and visual-only tasks. As the figure shows, the sis 

ASD group on average spent consistently less time on the face 519 

than the TD group in both tasks. A set of 2 (group: ASD, TD) by 520 

5 (time bin: 0-300 ms, 300-600 ms, 600-900 ms, 900-1200 ms, 521 

and 1200-1500 ms) mixed factor analyses of variance (ANOVAs) 522 

were conducted for AV speech-in-noise and visual-only, respec- 523 

tively. There was a significant main effect of group with less time 524 

spent on the face by the ASD group than the TD group for AV 525 

speech-in-noise with a marginal effect for visual-only [for AV 526 

speech-in-noise, ASD: M = 60.8, SD = 25.0; TD: M = 82.3 , 527 

SD = 21.9; £(1,18) = 6.31, p = 0.02, r|| = 0.22; for visual- 528 

only, ASD: M = 74.3, SD = 20.7; TD: M = 84.2, SD = 14.9; 529 

£(1,18) = 3.39, p = 0.08, r\ 2 G = 0.12]. These mean differences 530 

reflect moderate to large effect size estimates (Cohen, 1973; Olejnik 531 

and Algina, 2003; Bakeman, 2005). There was also a main effect of 532 

time bin in both analyses [AV speech-in-noise: £(4,72) = 26.48, 533 

p < 0.0001, t|q = 0.23; visual-only: £(4,72) = 42.7, p < 0.001, 534 

T|g = 0.41], reflecting a rapid increase in fixations on the face 535 

from the first to second bins that leveled off by the third bin. The 536 

interaction of group and time was not significant for either task. 537 

Next, we examined whether there were group differences in 538 

gaze to specific regions on the face. We chose the mouth region 539 

and non-focal areas (as defined above) as regions of interest 1 . We 540 

ran a set of 2 (group: ASD, TD) by 5 (time bin: 0-300 ms, 300- 541 

600 ms, 600-1200 ms, 1200-1500 ms) ANOVAs on the percentage 542 

of time spent in each region of interest out of time spent on the face, 543 

with separate analyses for the AV speech-in-noise and visual-only 544 

tasks, and separate analyses for the mouth region and non-focal 545 

areas. Figure 3 presents the relative percentages of time spent in 546 

each region of interest by group and time, separately for the AV 547 

speech-in-noise and visual-only tasks. 548 

First, consider the mouth region. There was a significant main 549 

effect of group for both tasks, with a relatively smaller percentage 550 

of time spent on the mouth region for the ASD group than the TD 551 

group [for AV speech-in-noise, ASD: M = 26.0, SD = 24.1; TD: 552 

M = 52.9, SD = 30.8; £(1,18) = 11.25, p < 0.005, i) 2 G = 0.29; for 553 

visual-only, ASD: M = 35.0, SD = 29.5; TD: M = 56.1, SD = 32.6; 554 

£(1,18) = 4.46, p = 0.05, i) 2 G = 0.14]. There was also a main 555 

effect of time for both tasks [AV speech-in-noise: £(4,72) = 23.18, 556 

p < 0.0001, r|^ = 0.32; visual-only: £(4,72) = 23.7, p < 0.0001, 557 

r|g = 0.30], with an overall increase in fixations on the mouth 558 

region from the first to third bins before leveling off. Interestingly, 559 

there was an interaction of group and time bin for AV speech- 560 

in-noise [£(4,72) = 10.06, p < 0.0001, r\ 2 G = 0.17], but not for 561 

visual-only (£ < 1). As shown in Figure 3, for AV speech-in-noise, 562 

fixations on the mouth region were similar for the two groups in 563 

the first time bin (0-300 ms, prior to the onset of mouth move- 564 

ment), but the subsequent increase in mouth region fixations was 565 

566 

567 

1 ln addition to the analyses of the mouth region and non-focal regions, we also 56 g 
conducted statistical analyses of fixations on other major face areas, namely the eyes 
and nose. However, each involved few fixations overall and the analyses did not 

reveal reliable differences between groups; thus, they are not reported here. 570 
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FIGURE 2 | Mean time spent on the face region as a percentage of 
time spent on-screen for each of the time bins and for the ASD 
group (closed circles) and the TD group (open squares). The left and 
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right panels present results for AV speech in noise and visual-only, 
respectively. Error bars represent standard errors, calculated 
independently for each time bin. 
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FIGURE 3 | Mean time spent on the mouth region (solid lines) and 
non-focal areas (dashed lines) as a percentage of time spent on the face 
for each of the time bins and for the ASD group (closed circles) and the 
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TD group (open squares). The left and right panels present results for AV 
speech in noise and visual-only, respectively. Error bars represent standard 
errors, calculated independently for each time bin. 



much more pronounced for the TD group than the ASD group. In 
contrast, in the visual-only task the two groups' trajectories across 
time were similar, differing in overall percentage of time in the 
mouth region. 

Next, consider the non-focal regions. For AV speech-in-noise, 
there was a significant main effect of group, with a relatively 
higher percentage of time spent fixating on non-focal regions by 
the ASD group than the TD group [ASD: M = 19.5, SD = 19.6; 
TD: M = 7.3, SD = 10.5; £(1,18) = 6.48, p < 0.05, T)| = 0.15]. 
There was not a significant main effect of time, £(4,72) = 1.11, 
ns, but there was a significant interaction of group and time, 
£(4,72) = 4.98, p < 0.005, = 0.12. Time spent on non-focal 
regions was similar for the two groups in the first time bin, but 
dropped off rapidly for the TD group while remaining relatively 
frequent for the ASD group across the whole trial. For visual-only, 
there was again a main effect of group [ASD: M = 17.3, SD = 16.9; 
TD: M = 9.2, SD = 12.6; £(1,18) = 5.43, p < 0.05, T) G = 0.11], 



along with a significant main effect of time, £(4,72) = 17.64, 
p < 0.0001, v\q = 0.37, with a decrease in time spent on non-focal 
regions from the first time bin to the subsequent bins. The inter- 
action of group and time (£ < 1) was not statistically significant 
in the visual-only task 2 . 



2 We initially considered the jaw as a non-focal region, but removed it from the 
category because of its extensive movement during the speech event (thus provid- 
ing information about the kinematics of the speech act), which distinguished it 
from other non-focal areas. However, we did repeat the analyses of the non-focal 
regions with the jaw included. This inclusion did not change the outcome for AV 
speech-in-noise, but it did for visual-only. In the visual-only task, there were con- 
siderably more fixations in the jaw region by the TD participants than the ASD 
participants (although, in an analysis of just fixations on the jaw, the difference was 
not statistically reliable). As a result, including jaw in the non-focal category had the 
effect of eliminating the statistically significant group difference in non-focal fixa- 
tions. However, this obscures an interesting difference between the groups: The ASD 
group spent relatively more time fixating on face areas that convey less information 
about the kinematics of the speech articulations (e.g., the cheeks). 
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FIGURE 4 | Mean time spent on the figure-eight shape region as a 
percentage of time spent on-screen for each of the time bins and for 
the ASD group (closed circles) and theTD group (open squares). Error 
bars represent standard errors, calculated independently for each time bin. 



The results in the speech tasks can be summarized as follows. 
First, the ASD group spent, on average, less time gazing on the face 
than the TD group, and this difference was more pronounced in 
the AV speech-in-noise task than in the visual-only task. Second, 
when fixating on the face, the ASD group spent relatively less time 
fixating on the mouth region than the TD group, and relatively 
more time fixating on non-focal regions. Finally, the two groups 
differed in their relative pattern of fixations on the speech over 
the course of a trial. Specifically, the TD group exhibited a pattern 
of initially looking at non-focal regions but then shifting to the 
mouth as the articulation unfolded. The ASD group had a similar 
but reduced shift in the visual-only task, but did not exhibit this 
shift in the AV speech-in-noise task. 

NON-SPEECH CONTROL CONDITIONS 

Finally, to assess whether there were group differences in gaze to 
the non-speech stimuli, a series of independent 2 (group: ASD, 
TD) x 5 (time bins: 0-300 ms, 300-600 ms, 600-900 ms, 900- 
1200 ms, and 1200-1500 ms) ANOVAs were run on fixations to 
the figure-eight shapes during time spent on screen. The earliest 
time bin encompasses pre-movement (0-300 ms), the next time 
bin (300-600 ms) an increase to maximum size; the third time 
bin (600-900 ms) from maximum size to minimum size and the 
final two time bins increasing until the end of the trial (900- 
1200 ms, 1200-1500 ms, see Figure 1). We defined two regions of 
interest: a narrow region encompassing an area around the outline 
of the figure-eight shape at its smallest point (see Figure 1C), 
and a broad region encompassing the area around the outline 
of the shape at its largest point (see Figure ID). We analyzed 
percentage of trials with fixations in each region at the previously 
defined time samples that incorporated the shape's transition from 
a small outline to a large one. The percentage of time spent in the 
broad region, shown in Figure 4, had a main effect of time bin 
[£(4,72) = 12.33, p < 0.0001, r\ 2 G = 0.13], due to an increase from 
the first bin (prior to movement) to the second, but no main effect 
of group [£(1,18) = 1.09, ns] and no interaction of group and 
time bin (£ < 1). The percentage of time in the narrow region 



also had a main effect of time bin [£(4,72) = 8.32, p < 0.001, 
r|g = 0.14], with less time in the inner region in the first bin (prior 
to movement) and in the last two bins (when the shape was larger), 
but again with no main effect of group (F < 1) and no interaction 
of group and time bin [£(4,72) = 1.10, ns]. Overall, the TD and 
ASD groups exhibited similar gaze patterns with the non-speech 
stimuli. 



DISCUSSION 

The current study examined pattern of gaze to a speaking face 
by children with ASD and a set of well-matched TD controls. 
Gaze was examined under conditions that create a strong incentive 
to attend to the speaker's articulations, namely, AV speech with 
background noise and visual only (speechread) speech. We found 
differences in the gaze patterns of children with ASD relative to 
their TD peers, which could impact their ability to obtain visible 
articulatory information. 

The findings indicated that children with ASD spent signifi- 
cantly less time gazing to a speaking face than the TD controls, 
which is consistent with diagnostic criteria for this disorder and 
findings from previous research (Hutt and Ounstead, 1966; Hob- 
son etal, 1988; Volkmar etal, 1989; Volkmar and Mayes, 1990; 
Phillips etal., 1992). The reduction in gaze to the face of the 
speaker was greater in the speech in noise than the visual-only 
condition. This suggests that children with ASD gaze at the face of 
the speaker when the task requires it, as in speechreading. This is 
perhaps consistent with the finding that the difference in percep- 
tual performance between the ASD and TD groups (Irwin et al., 
2011) was less pronounced in the visual-only condition than with 
speech in noise. 

Importantly, when fixated on the face of speaker, the children 
with ASD were significantly less likely to gaze at the speaker's 
mouth than the TD children in the context of both speech in 
noise and speechreading. This finding might appear to conflict 
with previous findings of increased gaze to the mouth by indi- 
viduals with ASD in comparison to TD controls (e.g., Klin et al., 
2002; Neumann etal, 2006; Spezio etal., 2007). However, this 
disparity may arise from the specific demands of the respective 
tasks. Findings of increased gaze on the mouth by children with 
ASD have typically occurred when the task required emotional 
or social judgments and when the mouth was not the primary 
source of the relevant information. In contrast, our study involved 
a speech perception task, so the mouth was the primary source 
of relevant (articulatory) information. These findings in tandem 
suggest that children with ASD paradoxically may be less likely 
to attend to the mouth when it carries greater informational 
value. 

Instead of gazing at the mouth during the speech in noise task, 
the children with ASD tended to spend more time directing their 
gaze to non-focal areas of the face (also see Pelphrey etal., 2002). 
Non-focal areas such as the ears, cheeks, and forehead carry little, 
if any, articulatory information. For speech in noise, as the speaker 
began to produce the articulatory signal, the TD children looked 
more to the mouth than did the children with ASD, who continued 
to gaze at non-focal regions. 

Notably, the group differences were less prominent in the 
visual-only condition, where visual phonetic information on the 
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mouth is fundamental to the task (in contrast to the speech-in- 
noise task, where there is an auditory speech signal) . In this case, 
the two groups exhibited a similar pattern of shifting from non- 
focal areas to the mouth region as the speaker began to produce 
the syllable, even though the ASD group overall spent relatively 
less time on the mouth and more time on non-focal regions than 
the TD controls. This finding suggests that children with ASD may 
be able to approximate a similar pattern of gaze to areas of the face 
that hold important articulatory information when it is required 
by the task. 

Finally, there were no significant differences by group in pattern 
of gaze for the non-speech, non-face control condition. This sug- 
gests that the differences in gaze patterns between children with 
ASD and TD do not necessarily occur for all AV stimuli, and are 
consistent with the notion that these differences are specific to 
speaking faces. 

In the Introduction, we outlined two possible reasons for 
why children with ASD are less influenced by visual speech 
information than their TD peers, even when they are fix- 
ated on the face (Irwin etal., 2011), namely, that they have 
an impairment in AV speech processing, or that they have 
reduced access to critical visual information. The present results 
do not address the question of a processing impairment, but 
they do offer insight into the issue of access to speech infor- 
mation. Because the mouth is the source of phonetically rel- 
evant articulatory information available on the face (Thomas 
and Jordan, 2004), our results may help account for the lan- 
guage and communication difficulties exhibited by children with 
ASD. 

To summarize, even with a sample of verbal children who were 
closely matched in language and cognition to controls, we found 
differences in pattern of gaze to a speaking face between chil- 
dren with ASD and TD controls. However, these findings should 
be interpreted with caution, given the small sample size, broad 
age range and varied diagnostic category. Future research should 
be conducted to assess how differences in each of these variables 
impacts pattern of gaze. In particular, an interesting question is 
whether pattern of gaze relates to communicative skill (e.g., as 
in Norbury etal, 2009; also see Falck-Ytter etal., 2012). A larger 
sample would allow for examination of this relationship. Fur- 
ther, the speech stimuli in the current study were consonant-vowel 
speech syllables; future research should also examine sentence level 
connected speech. 

Finally, future work should consider the possible implications 
of the results for intervention. Our results in the speech-in-noise 
task indicate that children with ASD may not spontaneously look 
to critical areas of a speaking face in the presence of background 
noise, even though it would improve comprehension. This is par- 
ticularly problematic in light of findings that auditory noise is 
especially disruptive for individuals with ASD in speech percep- 
tion (Alcantara et al, 2004). However, the results in the visual-only 
speechreading task, where children with ASD did tend to shift their 
gaze from non-focal areas to the mouth (albeit to a lesser degree 
than the TD controls), suggests that children with ASD can show 
more typical gaze patterns when necessary. Therefore, interven- 
tion to help individuals with ASD to gain greater access to visible 
articulatory information may be useful, with the goal of increased 



communicative functioning in the natural listening and speaking 
environment. 
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